14 research outputs found

    Finite precision deep learning with theoretical guarantees

    Get PDF
    Recent successes of deep learning have been achieved at the expense of a very high computational and parameter complexity. Today, deployment of both inference and training of deep neural networks (DNNs) is predominantly in the cloud. A recent alternative trend is to deploy DNNs onto untethered, resource-constrained platforms at the Edge. To realize on-device intelligence, the gap between algorithmic requirements and available resources needs to be closed. One popular way of doing so is via implementation in finite precision. While ad-hoc trial and error techniques in finite precision deep learning abound, theoretical guarantees on network accuracy are elusive. The work presented in this dissertation builds a theoretical framework for the implementation of deep learning in finite precision. For inference, we theoretically analyze the worst-case accuracy drop in the presence of weight and activation quantization. Furthermore, we derive an optimal clipping criterion (OCC) to minimize the precision of dot-product outputs. For implementations using in-memory computing, OCC lowers ADC precision requirements. We analyze fixed-point training and present a methodology for implementing quantized back-propagation with close-to-minimal per-tensor precision. Finally, we study accumulator precision for reduced precision floating-point training using variance analysis techniques. We first introduce our work on fixed-point inference with accuracy guarantees. Theoretical bounds on the mismatch between limited and full precision networks are derived. Proper precision assignment can be readily obtained employing these bounds, and weight-activation, as well as per-layer precision trade-offs, are derived. Applied to a variety of networks and datasets, the presented analysis is found to be tight to within 2 bit. Furthermore, it is shown that a minimum precision network can have up to 3.5×\sim3.5\times lower hardware complexity than a binarized network at iso-accuracy. In general, a minimum precision network can reduce complexity by up to 10×\sim10\times compared to a full precision baseline while maintaining accuracy. Per-layer precision analysis indicates that precision requirements of common networks vary from 2 bit to 10 bit to guarantee an accuracy close to the floating-point baseline. Then, we study DNN implementation using in-memory computing (IMC), where we propose OCC to minimize the column ADC precision. The signal-to-quantization-noise ratio (SQNR) of OCC is shown to be within 0.8 dB of the well-known optimal Lloyd-Max quantizer. OCC improves the SQNR of the commonly employed full range quantizer by 14 dB which translates to a 3 bit ADC precision reduction. The input-serial weight-parallel (ISWP) IMC architecture is studied. Using bit-slicing techniques, significant energy savings can be achieved with minimal accuracy lost. Indeed, we prove that a dot-product can be realized with single memory access while suffering no more than 2 dB SQNR drop. Combining the proposed OCC and ISWP noise analysis with our proposed DNN precision analysis, we demonstrate 6×\sim6\times reduction of energy consumption in DNN implementation at iso-accuracy. Furthermore, we study the quantization of the back-propagation training algorithm. We propose a systematic methodology to obtain close-to-minimal per-layer precision requirements for the guaranteed statistical similarity between fixed-point and floating-point training. The challenges of quantization noise, inter-layer and intra-layer precision trade-offs, dynamic range, and stability are jointly addressed. Applied to several benchmarks, fixed-point training is demonstrated to achieve high fidelity to the baseline with an accuracy drop no greater than 0.56\%. The derived precision assignment is shown to be within 1 bit per tensor of the minimum. The methodology is found to reduce representational, computational, and communication costs of training by up to 6×6\times, 8×8\times, and 4×4\times, respectively, compared to the baseline and related works. Finally, we address the problem of reduced precision floating-point training. In particular, we study accumulation precision requirements. We present the variance retention ratio (VRR), an analytical metric measuring the suitability of accumulation mantissa precision. The analysis expands on concepts employed in variance engineering for weight initialization. An analytical expression for the VRR is derived and used to determine accumulation bit-width for precise tailoring of computation hardware. The VRR also quantifies the benefits of effective summation reduction techniques such as chunked accumulation and sparsification. Experimentally, the validity and tightness of our analysis are verified across multiple deep learning benchmarks

    Analytical guarantees for reduced precision fixed-point margin hyperplane classifiers

    Get PDF
    Margin hyperplane classifiers such as support vector machines are strong predictive models having gained considerable success in various classification tasks. Their conceptual simplicity makes them suitable candidates for the design of embedded machine learning systems. Their accuracy and resource utilization can effectively be traded off each other through precision. We analytically capture this trade-off by means of bounds on the precision requirements of general margin hyperplane classifiers. In addition, we propose a principled precision reduction scheme based on the trade-off between input and weight precisions. Our analysis is supported by simulation results illustrating the gains of our approach in terms of reducing resource utilization. For instance, we show that a linear margin classifier with precision assignment dictated by our approach and applied to the `two vs. four' task of the MNIST dataset is ~2x more accurate than a standard 8 bit low-precision implementation in spite of using ~2x10^4 fewer 1 bit full adders and ~2x10^3 fewer bits for data and weight representation

    Analytical guarantees on numerical precision of deep neural networks

    No full text
    The acclaimed successes of neural networks often overshadow their tremendous complexity. We focus on numerical precision - A key parameter defining the complexity of neural networks. First, we present theoretical bounds on the accuracy in presence of limited precision. Interestingly, these bounds can be computed via the back-propagation algorithm. Hence, by combining our theoretical analysis and the backpropagation algorithm, we are able to readily determine the minimum precision needed to preserve accuracy without having to resort to timeconsuming fixed-point simulations. Wc provide numerical evidcncc showing how our approach allows us to maintain high accuracy but with lower complexity than state-of-the-art binary networks.1

    Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training

    Full text link
    Data clipping is crucial in reducing noise in quantization operations and improving the achievable accuracy of quantization-aware training (QAT). Current practices rely on heuristics to set clipping threshold scalars and cannot be shown to be optimal. We propose Optimally Clipped Tensors And Vectors (OCTAV), a recursive algorithm to determine MSE-optimal clipping scalars. Derived from the fast Newton-Raphson method, OCTAV finds optimal clipping scalars on the fly, for every tensor, at every iteration of the QAT routine. Thus, the QAT algorithm is formulated with provably minimum quantization noise at each step. In addition, we reveal limitations in common gradient estimation techniques in QAT and propose magnitude-aware differentiation as a remedy to further improve accuracy. Experimentally, OCTAV-enabled QAT achieves state-of-the-art accuracy on multiple tasks. These include training-from-scratch and retraining ResNets and MobileNets on ImageNet, and Squad fine-tuning using BERT models, where OCTAV-enabled QAT consistently preserves accuracy at low precision (4-to-6-bits). Our results require no modifications to the baseline training recipe, except for the insertion of quantization operations where appropriate.Comment: Published as a spotlight paper at ICML 2022. Paper contains 16 pages, 5 figures, and 6 table
    corecore